PaHCC: Printed and Handwritten Chinese Characters Dataset

1. Introduction

The PaHCC dataset was constructed by the State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation of Chinese Academy of Sciences (CASIA). We select 1000 frequently used Chinese characters in the GB2312-80 standard as our categories. Benefiting from efforts in the character recognition field, we reorganize and integrate the data from the two existing databases to construct our PaHCC dataset. The SCUT-SPCCI database contains synthetic printed character images generated from 280 different fonts. The data is stored as files in *.ccb format. CASIA-HWDB1.0-1.1 contains grayscale images segmented and labeled from scanned handwritten pages. The data is stored as files in *.gnt format. We parse out the data of selected categories from both databases according to their respective format instructions and save all data as *.png images without making any changes to the original data. For synthetic printed Chinese characters, all images are grayscale images of size 64×64. For handwritten Chinese characters, all images are grayscale with background pixel values set to 0 and irregular image sizes. Figure 1 shows some examples. This dataset is a large-scale and comprehensive Chinese character dataset. It can support research on many challenging problems related to the model's robustness, transferability, and interpretability in visual pattern recognition. Recommendations for usage include domain generalization, domain adaptation, structure-understanding model, and so on.

Figure 1. Examples of the PaHCC dataset. (a)-(c) are synthetic printed Chinese character images. (d) is scanned handwritten Chinese character images.


2. Data Structure and Statistics

PaHCC is our full dataset, which contains 1000 classes and 996,478 samples. The printed data comprises 280,647 synthetic Chinese character images with about 280 printed fonts. We divide them into three domains according to font types: standard printed fonts (domain 0), distorted printed fonts (domain 1), and handwriting-style printed fonts (domain 2). The handwritten data contains 715,831 scanned handwritten Chinese character images from 720 writers. We treat all handwritten data as test data. Figure 1 provides a visual representation of our data partitioning. Our dataset is organized in the directory structure: /domains/classes/samples, which provides class labels and domain labels as ground truth. We present fine-grained statistics of the dataset in Table 1.

Table 1. Statistics of the PaHCC dataset.

Considering the large scale of our complete dataset (PaHCC), we also organize a small dataset of 100 classes (mini-PaHCC) to ease the computational cost of the study. Table 2 presents fine-grained statistics of our mini-PaHCC dataset.

Table 2. Statistics of the mini-PaHCC dataset.

3. Data Download

The PaHCC and mini-PaHCC datasets are packed in a zip archive, respectively. Please click the links below for download.

PaHCC.zip (2.17 GB)

mini-PaHCC.zip (222.3 MB)


Reference

The PaHCC dataset was first used in the paper [1]. If this dataset helps you, please cite the following paper:

[1] Jiao Zhang, Xiang Ao, Xu-Yao Zhang, Cheng-Lin Liu. Towards Reliable Domain Generalization: Insights from the PF2HC Benchmark and Dynamic Evaluations, Pattern Recognition, 2025.